Patterns of rates of mortality in the Clinical Practice Research Datalink

The Clinical Practice Research Datalink (CPRD) is a widely used data resource, representative in demographic profile, with accurate death recordings but it is unclear if mortality rates within CPRD GOLD are similar to rates in the general population. Rates may additionally be affected by selection bias caused by the requirement that a cohort have a minimum lookback window, i.e. observation time prior to start of at-risk follow-up. Standardised Mortality Ratios (SMRs) were calculated incorporating published population reference rates from the Office for National Statistics (ONS), using Poisson regression with rates in CPRD GOLD contrasted to ONS rates, stratified by age, calendar year and sex. An overall SMR was estimated along with SMRs presented for cohorts with different lookback windows (1, 2, 5, 10 years). SMRs were stratified by calendar year, length of follow-up and age group. Mortality rates in a random sample of 1 million CPRD GOLD patients were slightly lower than the national population [SMR = 0.980 95% confidence interval (CI) (0.973, 0.987)]. Cohorts with observational lookback had SMRs below one [1 year of lookback; SMR = 0.905 (0.898, 0.912), 2 years; SMR = 0.881 (0.874, 0.888), 5 years; SMR = 0.849 (0.841, 0.857), 10 years; SMR = 0.837 (0.827, 0.847)]. Mortality rates in the first two years after patient entry into CPRD were higher than the general population, while SMRs dropped below one thereafter. Mortality rates in CPRD, using simple entry requirements, are similar to rates seen in the English population. The requirement of at least a single year of lookback results in lower mortality rates compared to national estimates.


Introduction
Representing one of the world's largest primary care databases, the Clinical Practice Research Datalink (CPRD) contains anonymised patient level data captured at consenting general practitioner (GP) practices throughout the United Kingdom. Covering approximately 7% of the UK population, CPRD contains information on demographics, clinical results, medication usage, hospital admission, referrals, registration details and death [1]. CPRD has been shown to be representative of ethnicity, sufficiently accurate in recordings of death and comparable to other populations with regards to age and sex distribution [2][3][4]. A common research area of Electronic Health Records (EHRs) research, including the use of CPRD, is the effect of diseases on mortality and it is therefore imperative to understand how mortality rates in a selected CPRD population compare with general population rates. The selection of cohorts on the requirement of individuals having been registered at a contributing GP practice for a specific length of time is commonplace within EHR research [5][6][7][8][9][10]. Sometimes referred to as research-quality follow-up, or lookback window, it is an observation period prior to the start of a subject's at-risk follow-up, ending at a date often referred to as the index date. This lookback period may be used for the clinical assessment of a comorbid condition or diagnoses, or to identify medication history. The selection effect of these delayed-entry conditions on estimated mortality rates is unknown.
In order to assess mortality rates in CPRD and the effect of the requirement for a lookback window, Standardised Mortality Ratios (SMRs) were estimated over two time scales; calendar year and follow-up period utilising CPRD data for the period 2000 to 2018.

CPRD cohort and patient timelines
The data used comprised of CPRD GOLD patients deemed as having research acceptable data with data linkages to both the Office for National Statistics (ONS) for death registration data and secondary hospital admission data from Hospital Episode Statistics (HES). These commonly applied data linkages reduce the geographical area of CPRD to only the English data contribution. A random sample of 1 million patients was taken without replacement from research acceptable patients with data linkages to both HES and ONS, who were �18 years old and alive with CPRD follow-up after 1 January 2000. Details of the random sample and associated Stata code can be found in the S1 File. This defined the cohort entry or index date, I(0), of our cohort from which mortality follow-up started (Fig 1).
A composite start date, S, was defined for each patient as the latest of the date of registration at their GP practice (first or current registration date) and the date the practice data was deemed to be of research quality or "up-to-standard" [11]. An end date, E, was defined as the earliest of the practice's last data collection date, a patient's date of transfer out of their GP   (Fig 1). Four sub-cohorts were selected to have a lookback window, W, of at least 1, 2, 5 or 10 years. For each instance, a new cohort index date, I(w), was defined, signifying the start of at-risk follow-up, where W�w, w = 1, 2, 5, 10. For each new sub-cohort, those with lookback window <w years were omitted from the analysis. The at risk period for each individual was end date, E, minus the cohort index date, I(w), (in years) and a crude death rate was calculated for each sub-cohort as the number of deaths divided by the total person-time at-risk, expressed per 1000 person-years. A Charlson Comorbidity Index (CCI) [12] score was calculated per patient using comorbid conditions identified in HES in the 10 years prior to cohort index date I(w), baseline. The scores were classified into four groups for those with a CCI score at baseline of zero, one, two and three or more.

Standardised mortality ratios
The SMR is an indirect standardisation measure giving an estimate of the relative increase or decrease in mortality in a study population compared to a reference population. It is calculated as the ratio of the observed number of deaths ðD ¼ P N i¼1 d i Þ within the study cohort to the expected number of deaths in the reference population (E), with d i = 1 if individual i dies and 0 otherwise; i = 1,. . .,N. The expected number of deaths are defined as E ¼ is the mortality rate in the reference population for stratum k, defined by unique gender, age and calendar year combinations, and t k is the cohort's total time at-risk (measured in personyears) for that stratum. The estimation of the reference mortality rates are obtained from national actuarial life-tables published by ONS [13]. These provide precise estimates of mortality rates in the reference population, utilising mid-year population estimates and recorded mortality counts. An estimate of the overall SMR is obtained by modelling the number of observed deaths in the cohort in stratum k, d k , such that d k~P oisson(E k ), where E k = E[d k ] = λ k t k and λ k is the cohort mortality rate in stratum k. To incorporate the expected number of deaths we use Poisson regression with a log link and two offsets, log(t k ) and log ðl � k Þ, to obtain as the overall SMR, accounting for the stratum-specific mortality rates. The model can be extended to estimate stratum-specific SMRs by inclusion of explanatory variables in the Poisson regression model [14][15][16]. For example, we obtained estimates of calendar-year specific SMRs from data grouped by strata using the model

SMR by follow-up period
For the full cohort of 1 million randomly sampled CPRD GOLD patients, time-since-entry, defined as the time from index date in years (Fig 1), was included in the estimation model, providing estimates of SMRs by follow-up period. When estimating SMRs by follow-up period f, the data are split additionally by the third timescale, time-since-entry, defined as The inclusion of age groups (18-59, 60-69, 70-79, 80-89, 90-99) as an interaction with follow-up period allowed for SMRs to vary by age group over follow-up period.
All analysis and modelling procedures were performed in Stata 16. This research was approved by the Independent Scientific Advisory Committee (ISAC) for Medicines and Healthcare products Regulatory Agency Database Research (19_253RA). Generic ethical approval for observational research using the CPRD with approval from ISAC has been granted by a Health Research Authority Research Ethics Committee. Individual patient consent is not required.

Results
Over the almost 19-year period (1 st January 2000 -31 st December 2018), there were 78 729 deaths (7.9%) in the full CPRD random sample cohort (n = 1 000 000), Table 1. Each selected sub-cohort with the required lookback window W�w [w = 0,1,2,5,10], resulted in reduced cohort sizes. The sample size decreased to n = 876 048 for the sub-cohort with at least 1 year lookback, n = 771 175 for W�2 years, n = 568 114 for W�5 years and n = 370 780 for W�10 years. There was some evidence of geographical variation between the sub-cohorts with the relative contribution of patients and practices from the London region decreasing for subcohorts with longer lookback windows. The patient pre-index CPRD history (defined as index date-start date in years) was on average 1.84 years for those with no lookback requirement, with a minimum of zero years of CPRD history, while some subjects had over 18 years of history prior to their start of at-risk follow-up. The mean pre-index CPRD history increased with increases in the lookback window requirement. Gender ratio and mean age at start date and mean age at death date remained consistent over all sub-cohorts whilst mean age at index date and end date increased with lookback reflecting an older population in the sub-cohorts. Despite this, the percentage of deaths in follow-up remained relatively consistent over subcohorts while follow-up decreased from over 6.5 million person-years to 2.2 million personyears from zero to ten years lookback. The mean follow-up per individual remained constant at around 6 years.
The crude death rate remained relatively stable, increasing only slightly in the ten year lookback sub-cohort. The large majority of subjects had no comorbidity at baseline across all subcohorts. The proportion with no comorbidity score at baseline decreased with increases in lookback, with all other comorbidity groups increasing as comorbidity burden rose due to an aging population. In those with ten years of lookback the proportion with no comorbidity reduced to 88%, compared to 91% in the sub-cohort with five years of lookback. A small increase was also seen in the mean CCI score.
Practice registration history in CPRD for patients in the full CPRD random sample (n = 1 000 000), starting when a practice is deemed to provide up-to-standard data and ending at the date of last data collection, had a mean of 16.65 (SD = 7.03) years. The longest registration was 31.6 years, while the shortest was 68 days.

Lookback window and effect on SMR
The overall SMR for the 1 million CPRD random sample was 0.980 [95% confidence interval (CI) (0.973, 0.987)]. As suggested by the overall SMR, the cohort with no requirement of lookback window (w = 0) had SMRs that tended to be just below one. With increasing amounts of lookback window came reduced SMRs. The requirement of at least a single year of lookback resulted in a SMR of 0.905 (0.898-0.912). The subsequent increase in lookback revealed a trend of decreasing overall SMRs; for two years of lookback (W�2) a SMR of 0.881 (0.874- Table 1

Mortality by follow-up in CPRD
In the full cohort there was evidence of an initial high SMR in the first two years after entry, Table in S1 File). After the second year of follow-up, mortality rates reverted to below national background rates. When considered across all follow-up periods, the mortality rate in the cohort was just below the mortality rate in the general population, overall SMR = 0.980 (0.973-0.987).

Mortality by follow-up and age group in CPRD
SMRs were estimated by follow-up and age group, Fig 5. This confirmed that the initial high SMR seen overall (Fig 4) was present in all age groups, yet the effect was lowest in the youngest age group (18-59). Older age groups had higher initial SMRs and lower SMRs in later followup, yet in all age groups the SMR fell below one after the third year of follow-up. This trend continued up to 19 years after study entry (index date).

Discussion
Overall, mortality rates in the unrestricted CPRD GOLD random sample population of 1 million patients are similar to mortality rates seen in the general English population. The inclusion of a lookback window requirement of even a single year resulted in a significantly lower mortality rate in the sub-cohort once accounting for age and sex when compared with the English population. This implies that a healthier population is being selected, creating a form of selection bias. The requirement of a lookback window may inadvertently remove high-risk patients, or simply result in the selection of a more "stable" patient population. Longer registration periods with a single primary care provider may additionally result in more medically vigilant and compliant patients, all indicative of a healthier patient subgroup.
The end date of a patient's follow-up, as in many EHR studies, represents a compound measure including data specific to an individual and data contributed by their registered GP practice. The end date utilised here is either the patient's date of transfer out (which can be for reasons of death), date of death, the date of last data collection from their GP practice or the administrative censoring date, whichever came earliest. As the requirement for more lookback increases, so does the proportion of patient's end dates defined by the date of last data collection from their registered GP practice. This form of censoring, though likely to be uninformative, should be examined and the impact of the selection of practices no longer contributing to CPRD considered. Similarly, the increase in lookback increases the number who reach administrative censoring, while the number of patients who transfers out of a registered GP practice decreases, emphasising the "stable" population narrative but these reasoning's may be an oversimplification of the mechanisms at play and need further investigation.
The complexity regarding the anonymity of CPRD data may be a driving factor in the high initial SMRs. Patients in CPRD represent unique lines of data. If a patient transfers out of their elected GP practice and into a new practice (for a multitude of reasons such as at their request or due to the change of residential address), this results in the creation of a "new" patient record in CPRD on registration with their new primary care provider. Therefore, it is conceivable for CPRD to contain multiple patient's records that are in fact the same individual. At current, utilising only CPRD as a data source, there is no mechanism to link these records together. It is theorised that the transfer out of patients from one GP practice and their subsequent death shortly after re-registration with a new GP practice may be accountable for a portion of the high initial SMRs seen in the first two years of follow-up.
As a hypothetical example, consider an elderly patient who transfers out of their current longstanding GP practice and moves residence into assisted care housing, registers at the closest GP practice or a GP practice associated with the care home and then passes away 10 months after re-registration. Within the context of the data available, this would be seen as two individual records in CPRD, the first with a long CPRD record with no mortality event as the patient transferred out, and the second having a death within 10 months of registration. This hypothesis is partly supported by the finding that younger patients have lower initial SMRs than older patients do. Further investigation is needed to assess if subjects that are re-registering at a new GP practice (with previous CPRD registration history) are at a higher risk than new CPRD patients are.
A number of limitations have been identified in this research. This research was performed on a random sample of patients from CPRD and so does not represent the entirety of CPRD GOLD. Additionally, this data represented only data derived from an English population. The generalisability of these results to CPRD Aurum, other geographical areas within the United Kingdom and other large scale primary care EHRs is unknown. The lack of a full date of birth per patient, with only a birth year provided could have a marginal effect on results, while the unavailability of a linkage mechanism between de-and-re-registered patients proves vastly more problematic. The size of the sample (1 million patients) is seen as a strength though, along with the use of a robust statistical model, in the form of Poisson regression, considering changes over calendar year and follow-up, modelled on multiple time scales (age and calendar year).

Conclusions
Regardless of the mechanism or reasoning for the selection effect or high initial mortality rates when compared to the general population, the results of reduced mortality rates with increased lookback window periods and high initial mortality rates in CPRD is significant and should be noted by all who use CPRD in the study of mortality. The use of these lookback periods is commonplace, and the implicit assumption that CPRD is representative of mortality in the general population must be carefully considered. If the requirement of lookback is consistently applied to both the study population and control group, then comparisons between groups may be valid leading to internal validity. However, when the results of a study are to be generalised to the wider population, the representativeness of the CPRD cohort should be questioned. In addition, the higher rates of mortality compared to adjusted general population rates, in the first two years of entry into CPRD, also need to be considered when addressing research questions using CPRD.